Back

Journal of the American Medical Informatics Association

53 training papers 2019-06-25 – 2026-03-07

Top medRxiv preprints most likely to be published in this journal, ranked by match strength.

1
Show Your Work: Verbatim Evidence Requirements and Automated Assessment for Large Language Models in Biomedical Text Processing
2026-03-04 health informatics 10.64898/2026.03.03.26346690
#1 (23.7%)
Show abstract

PurposeLarge language models (LLMs) are used for biomedical text processing, but individual decisions are often hard to audit. We evaluated whether enforcing a mechanically checkable "show your work" quote affects accuracy, stability, and verifiability for trial eligibility-scope classification from abstracts. MethodsWe used 200 oncology randomized controlled trials (2005 - 2023) and provided models with only the title and abstract. Trials were labeled with whether they allowed for the inclusio...

2
Fully Automated Systematic Review Generation via Large Language Models: Quality Assessment and Implications for Scientific Publishing
2026-02-23 health informatics 10.64898/2026.02.18.26346559
#1 (23.2%)
Show abstract

Large language models (LLMs) are increasingly transforming scientific workflows, yet their application to rigorous evidence synthesis remains underexplored. Through the execution of a single Python script, we present a fully automated pipeline leveraging the Claude API to generate systematic reviews from literature search through manuscript completion without human intervention. Our pipeline processes hundreds of papers through iterative API calls for inclusion evaluation, information extraction...

3
Development and validation of an algorithm to identify front-line clinicians using EHR audit log data
2026-02-16 health informatics 10.64898/2026.02.13.26346268
#1 (23.1%)
Show abstract

BackgroundInterprofessional teams are central to high quality patient care. However, identifying the clinician primarily responsible for a patient requires labor-intensive methodologies. Although electronic health record (EHR) audit logs offer a scalable alternative, its use for identifying frontline clinicians is underdeveloped. ObjectiveTo develop and validate an algorithm utilizing EHR audit logs to identify the primary frontline clinician per patient day of an encounter and to describe care...

4
Evaluating a Locally Deployed 20-Billion Parameter Large Language Model for Automated Abstract Screening in Systematic Reviews
2026-03-04 health informatics 10.64898/2026.03.04.26347506
#1 (22.2%)
Show abstract

BackgroundSystematic reviews (SRs) are essential for evidence-based medicine but require extensive time and resources for abstract screening. Large language models (LLMs) offer potential for automating this process, yet concerns about data privacy, intellectual property protection, and reproducibility limit the use of cloud-based solutions in research settings. ObjectiveTo evaluate the performance of a locally deployed 20-billion parameter LLM for automated abstract screening in systematic revi...

5
Clinicians' Rationale for Editing Ambient AI-Drafted Clinical Notes: Persistent Challenges and Implications for Improvement
2026-02-22 health informatics 10.64898/2026.02.20.26346729
#1 (22.2%)
Show abstract

Structured AbstractO_ST_ABSObjectiveC_ST_ABSThe use of ambient AI documentation tools is rapidly growing in US hospitals and clinics. Such tools generate the first draft of clinical notes from scribed patient-provider conversations, which clinicians can then review and edit before signing into electronic health records (EHR). Understanding how and why clinicians make modifications to AI-generated drafts is critical to improving AI design and clinical efficiency, yet it has been under-studied. Th...

6
Boards-style benchmarks overestimate prior-chat bias in large language models: a factorial evaluation study
2026-02-14 health informatics 10.64898/2026.02.12.26346164
Top 0.1% (21.7%)
Show abstract

BackgroundLarge language models (LLMs) are increasingly piloted as chat interfaces for chart review and clinical decision support. Although leading models achieve and even exceed physician-level accuracy on exam-style benchmarks such as MedQA, recent perturbation studies show large drops in accuracy after small changes to prompts, distractor content, or answer format. Prior work has not systematically examined how these vulnerabilities unintentionally manifest in clinically realistic settings, i...

7
Sino-US-DrugQA: A Benchmark for Evaluating Large Language Models in Cross-Jurisdictional Pharmaceutical Regulation
2026-02-17 health informatics 10.64898/2026.02.13.26346236
Top 0.1% (21.6%)
Show abstract

Cross-jurisdictional pharmaceutical compliance requires comparative analysis of regulatory requirements across jurisdictions such as the US FDA and Chinas NMPA. Although large language models (LLMs) are increasingly explored for healthcare-related applications, their performance in cross-jurisdictional regulatory comparison has not been systematically characterized using dedicated benchmarks. This study introduces Sino-US-DrugQA, a bilingual benchmark dataset designed to evaluate LLM performance...

8
Agentic Trial Emulation to Learn Health System-specific Drug Effects At Scale
2026-02-20 health informatics 10.64898/2026.02.19.26346539
Top 0.1% (21.2%)
Show abstract

ObjectiveElectronic Health Record (EHR)-based trial emulation can support translation of randomized clinical trial (RCT) evidence into practice, yet emulations often diverge from published RCT results. We hypothesized that these discrepancies are structured and learnable properties of a health systems data-generating process, and that autonomous agentic workflows can generate discrepancies at the scale required for cumulative learning. Materials and MethodsWe developed an agentic trial emulatio...

9
Representation Before Retrieval: Structured Patient Artifacts Reduce Hallucination in Clinical AI Systems
2026-02-16 health informatics 10.64898/2026.02.13.26346256
Top 0.2% (18.0%)
Show abstract

BackgroundLarge language models show promise for clinical decision support, yet their propensity for hallucination--generating plausible but unsupported claims--poses sub-stantial patient safety risks. Retrieval-augmented generation (RAG) is widely assumed to mitigate this problem by grounding outputs in retrieved documents, but this assumption remains inadequately tested in clinical contexts where information density, temporal complexity, and safety stakes are uniquely high. MethodsWe develope...

10
PhenoSS: Phenotype semantic similarity-based approach for rare disease prediction and patient clustering
2026-03-02 health informatics 10.64898/2026.02.26.26347219
Top 0.2% (18.0%)
Show abstract

ObjectiveSystematic clinical phenotyping using Human Phenotype Ontology (HPO) is central to rare disease diagnosis. However, current disease prioritization (ranking candidate diseases from HPO for a patient) methods face key challenges: they often fail to account for the hierarchical structure of HPO terms, ignore dependencies among correlated terms, and do not adjust for batch effects arising from systematic differences in phenotype documentation across cohorts, institutions, or clinicians. We ...

11
Identifying Reasons for ACEI/ARB Non-Use in CKD Using Scalable Clinical NLP with Schema-Guided LLM Augmentation
2026-02-12 health informatics 10.64898/2026.02.10.26346025
Top 0.2% (18.0%)
Show abstract

IMPORTANCEAlthough angiotensin-converting enzyme inhibitors (ACEIs) and angiotensin receptor blockers (ARBs) are recommended for people with chronic kidney disease (CKD), they remain underused. Barriers to adherence, such as adverse effects or patient refusal, are frequently embedded within unstructured clinical narratives and are therefore inaccessible to structured data analytics. Scalable natural language processing (NLP) approaches are needed to identify these barriers and support guideline-...

12
Trustworthy personalized treatment selection: causal effect-trees and calibration in perioperative medicine
2026-03-04 health informatics 10.64898/2026.03.03.26347440
Top 0.3% (17.9%)
Show abstract

BackgroundPersonalized medicine promises to tailor treatments to the individual, but it carries a hidden risk: mistaking statistical noise for actionable clinical insight. Current machine learning approaches often provide predictions, but fail to inform clinicians when those predictions are unreliable. ObjectiveDevelop a deployment-readiness framework that integrates causal inference, interpretable effect-trees, and calibration assessment to distinguish actionable signal from unreliable variati...

13
Multi-Model Clinical Validation of an AI-Powered Biomarker Analysis Framework: A Cross-Vendor Benchmark on 4,018 NHANES Patients
2026-02-17 health informatics 10.64898/2026.02.13.26346284
Top 0.3% (17.8%)
Show abstract

BackgroundLarge language models (LLMs) show promise for clinical decision support, yet most validation studies evaluate single models, leaving questions about generalizability and vendor dependence unanswered. We assessed whether a standardized biomarker analysis framework maintains clinical-grade accuracy across multiple LLMs from independent providers. MethodsWe developed a structured prompt-based framework for detecting eight clinical patterns (insulin resistance, diabetes, cardiovascular di...

14
Randomized Trial Protocol: Epic Generative AI Chart Summarization Tool to Reduce Ambulatory Provider Cognitive Task Load
2026-02-22 health informatics 10.64898/2026.02.20.26346503
Top 0.3% (17.8%)
Show abstract

BackgroundEHR documentation and chart review contribute to clinician workload and burnout. To alleviate pre-charting burden, Epic has released a new generative AI chart summarizer tool, which has become widely adopted; however, its impact has not been examined in randomized trials. ObjectiveTo evaluate whether access to an Epic generative AI chart summarization tool reduces cognitive task load among ambulatory providers compared with usual care. MethodsTwo-arm, parallel-group randomized contro...

15
Linguistic Effects of Ambient AI on Clinical Documentation: A Matched Pre-Post Study
2026-02-17 health informatics 10.64898/2026.02.16.26346370
Top 0.3% (17.8%)
Show abstract

Ambient intelligence-based systems are increasingly used for clinical documentation. To quantify linguistic differences associated with ambient documentation, we conducted a matched pre-post analysis of 6,026 outpatient clinical notes from Mass General Brigham following implementation of two ambient AI documentation systems (Nuance Dragon Ambient eXperience [DAX] and Abridge). Within-clinician comparisons focused on the History of Present Illness (HPI) and Assessment and Plan (A&P) sections and ...

16
Care Plan Generation for Underserved Patients Using Multi-Agent Language Models: Applying Nash Game Theory to Optimize Multiple Objectives
2026-02-25 health informatics 10.64898/2026.02.23.26346934
Top 0.3% (17.7%)
Show abstract

BackgroundClinicians in care management programs are often in low supply relative to patient demand, especially in US Medicaid programs, and must simultaneously address clinical risk, time efficiency, and patients social needs. Many studies have shown that large language models may assist in their tasks for summarizing patient care, such as in generating care plans; yet these studies also show that different objectives given to agents often conflict and produce problems for safety, efficiency an...

17
Interpretable Fine-tuned Large Language Models Facilitate Making Genetic Test Decisions for Rare Diseases
2026-03-02 health informatics 10.64898/2026.02.26.26347223
Top 0.3% (17.7%)
Show abstract

Clinical decision making often relies on expert judgment guided by established guidelines, which can be challenging to standardize and abstract to implement. For example, selecting between gene panels and whole exome/genome sequencing (WES/WGS) for rare disease diagnosis frequently requires interpretation of evidence-based recommendations from the American College of Medical Genetics and Genomics (ACMG) guideline. Traditional machine learning (ML) models predicting suitable genetic tests often f...

18
Can Machine Learning Algorithms use Contextual Factors to Detect Unwarranted Clinical Variation from Electronic Health Record Encounter Data during the Treatment of Children Diagnosed with Acute Viral Pharyngitis
2026-03-02 health informatics 10.64898/2026.02.23.26346757
Top 0.4% (17.7%)
Show abstract

Rationale, Aims and ObjectivesUnwarranted clinical variation (UCV) in patient care often arises from contextual factors and contributes to increased costs, unnecessary treatments, and deviations from evidence-based practice. Detecting UCV is challenging due to the complexity of care decisions. Current approaches rely on centralized data aggregation and mixed-effects regression, which estimate relative variation but cannot detect absolute variation. Moreover, machine learning (ML) methods leverag...

19
An LLM-assisted framework for accelerated and verifiable clinical hypothesis testing from electronic health records
2026-02-12 health informatics 10.64898/2026.02.10.26346008
Top 0.4% (17.6%)
Show abstract

Acquiring insights from electronic health records (EHRs) is slowed by manual analytical workflows that limit scalability and reproducibility. We present LATCH (LLM-Assisted Testing of Clinical Hypotheses), an agentic framework that converts natural language clinical hypotheses into fully auditable analyses on structured EHR data. LATCH integrates LLM-assisted semantic layers with deterministic execution pipelines to automate cohort construction, statistical analysis, and result reporting, while ...

20
Medical concept understanding in large language models is fragmented
2026-03-05 health informatics 10.64898/2026.03.03.26347552
Top 0.4% (17.6%)
Show abstract

Large language models (LLMs) perform strongly across a wide range of medical applications, yet it remains unclear whether such success reflects genuine understanding of medical concepts. We present an ontology-grounded, concept-centered evaluation of medical concept understanding in LLMs. Using 6,252 phenotype concepts from Human Phenotype Ontology, we decompose concept understanding into three core dimensions--concept identity, concept hierarchy, and concept meaning--and design corresponding be...